Skip to content

Reap fork children off the proxy loop#1816

Merged
mikasenghaas merged 1 commit into
mainfrom
codex/v1-reap-off-loop
Jun 23, 2026
Merged

Reap fork children off the proxy loop#1816
mikasenghaas merged 1 commit into
mainfrom
codex/v1-reap-off-loop

Conversation

@xeophon

@xeophon xeophon commented Jun 21, 2026

Copy link
Copy Markdown
Member

Overview

Move fork-child reaping and private-CWD deletion off the shared MCP proxy event loop while preserving the fork parent's single-threaded spawn boundary.

Why

A rollout close currently performs blocking waitpid(..., 0) and recursive rmtree(...) calls directly on the proxy loop. A metadata-heavy child directory can therefore pause heartbeats, timers, and requests to otherwise-ready rollout children for the full cleanup duration.

Design

  • Keep child lookup, removal, waiter notification, and SIGKILL ownership on the proxy loop.
  • Serialize cleanup with the existing fork lock so no replacement child can reuse a directory while it is being deleted and no fork() can run while a cleanup thread exists.
  • Run ordered waitpid and rmtree work in one short-lived executor worker, then join that worker before releasing the fork lock.
  • Let requests to already-ready children continue through the lock-free fast path.
  • Finish the cleanup future before propagating cancellation, so a removed child cannot leave its private directory behind.

Performance and resources

A standalone PEP 723 benchmark reaped one real child and deleted an identical tree of 20,000 one-byte files across 21 directories. Three paired macOS/APFS runs produced these medians:

Metric Synchronous Off-loop Change
Maximum proxy-loop gap 0.852173 s 0.001872 s 0.850301 s recovered (99.8%)
Cleanup wall time 0.851710 s 0.896938 s +0.045228 s (+5.3%)
Retained RSS delta 376,832 B 671,744 B +294,912 B (~0.281 MiB)
Peak traced Python allocation 137,918 B 228,641 B +90,723 B (~88.6 KiB)
Maximum asyncio tasks 2 2 unchanged
Active threads during cleanup 1 2 one temporary worker

The optimization targets loop responsiveness rather than deletion throughput: the filesystem work remains the same, and wall time varies with metadata/cache behavior. The temporary worker is fully joined before another child may be forked. The workload opens no network connections, copies no bytes, and removes the same child process, files, directories, and payload in both modes.


Note

Medium Risk
Changes fork/reap synchronization and teardown ordering in the Linux-only MCP multiplexer; mistakes could leak temp dirs or briefly block new forks during cleanup, but scope is limited to forked tool-server lifecycle.

Overview
Rollout teardown in the fork-per-rollout MCP multiplexer no longer runs blocking waitpid and shutil.rmtree on the shared proxy asyncio loop.

reap still removes the child, wakes waiters, and sends SIGKILL on the loop, but it now holds the existing forking lock while a single-thread executor performs ordered cleanup. The loop awaits that work (with asyncio.shield and completion on cancel) so directories are not deleted mid-fork() and private CWDs are not left behind if the reap task is cancelled.

Ready children keep using the lock-free fast path in ensure; only spawn/reap paths contend on the fork lock.

Reviewed by Cursor Bugbot for commit be789c5. Bugbot is set up for automated code reviews on this repo. Configure here.

Note

Reap fork children off the proxy event loop in serve_forked

  • The reap coroutine in multiplex.py now acquires the forking lock before reaping, serializing cleanup with concurrent fork operations.
  • Blocking waitpid and directory removal are offloaded to a ThreadPoolExecutor via loop.run_in_executor, keeping the event loop unblocked.
  • asyncio.shield wraps the cleanup future so that cancellation of the coroutine still awaits full cleanup before re-raising.

Macroscope summarized be789c5.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 2cb75b6. Configure here.

Comment thread verifiers/v1/mcp/multiplex.py
@macroscopeapp

macroscopeapp Bot commented Jun 21, 2026

Copy link
Copy Markdown

Approvability

Verdict: Needs human review

Changes concurrency and lock semantics in fork/reap process management—introduces ThreadPoolExecutor for cleanup and extends async lock scope. These runtime behavior modifications to critical infrastructure warrant careful review for potential side effects.

You can customize Macroscope's approvability policy. Learn more.

@xeophon xeophon changed the base branch from feat/nano-as-v1 to main June 23, 2026 04:11
@xeophon xeophon force-pushed the codex/v1-reap-off-loop branch from a0c743e to be789c5 Compare June 23, 2026 04:17
@xeophon xeophon requested a review from mikasenghaas June 23, 2026 04:24
@mikasenghaas mikasenghaas merged commit d7ad39f into main Jun 23, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants